My student ID is 22202363.
In this assignment I will collect business data and user reviews from the https://www.yelp.com/developers/documentation/v3/get_started for restaurants & bars in Dublin.
import pandas as pd
import numpy as np
#!pip install geopy
from geopy.geocoders import Nominatim
# Read the data file into a pandas data frame.
datapath = "raw.csv"
raw = pd.read_csv(datapath, index_col=0)
raw.head()
| id | name | category | rating | reviews | price | zipcode | latitude | longitude | city | address1 | address2 | address3 | display_address | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | fR-pJ6nUn1bjPuT6lS2bsQ | The Brazen Head | pubs | 4.0 | 739 | €€ | 8 | 53.344970 | -6.276330 | Dublin | 20 Bridge Street Lower | NaN | NaN | ['20 Bridge Street Lower', 'Dublin 8', 'Republ... |
| 1 | A-HzqcGJVTwHVFTVH_LlPA | The Temple Bar | pubs | 4.0 | 550 | €€ | 2 | 53.345500 | -6.264190 | Dublin | 47/48 Temple Bar | Temple Bar | NaN | ['47/48 Temple Bar', 'Temple Bar', 'Dublin 2',... |
| 2 | rKvPQZcgjrQOLRU0phPoAQ | Queen of Tarts | desserts | 4.5 | 511 | €€ | 2 | 53.344121 | -6.267529 | Dublin | Cork Hill | Dame Street | NaN | ['Cork Hill', 'Dame Street', 'Dublin 2', 'Repu... |
| 3 | _449xLONUU9nAUzCja2bNA | The Porterhouse Temple Bar | pubs | 4.0 | 369 | €€ | 2 | 53.345100 | -6.267550 | Dublin | 16-18 Parliament Street | NaN | NaN | ['16-18 Parliament Street', 'Dublin 2', 'Repub... |
| 4 | -VIve-QeHR9-cKr7QldqtA | Elephant & Castle | tradamerican | 4.0 | 345 | €€ | 2 | 53.345600 | -6.262470 | Dublin | 18 Temple Bar | NaN | NaN | ['18 Temple Bar', 'Dublin 2', 'Republic of Ire... |
raw.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 1734 entries, 0 to 1733 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 1734 non-null object 1 name 1734 non-null object 2 category 1734 non-null object 3 rating 1734 non-null float64 4 reviews 1734 non-null int64 5 price 1425 non-null object 6 zipcode 1604 non-null object 7 latitude 1732 non-null float64 8 longitude 1732 non-null float64 9 city 1734 non-null object 10 address1 1723 non-null object 11 address2 487 non-null object 12 address3 23 non-null object 13 display_address 1734 non-null object dtypes: float64(3), int64(1), object(10) memory usage: 203.2+ KB
data1 = raw
data1.drop_duplicates(inplace = True)
data1.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 1539 entries, 0 to 1733 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 1539 non-null object 1 name 1539 non-null object 2 category 1539 non-null object 3 rating 1539 non-null float64 4 reviews 1539 non-null int64 5 price 1248 non-null object 6 zipcode 1417 non-null object 7 latitude 1537 non-null float64 8 longitude 1537 non-null float64 9 city 1539 non-null object 10 address1 1528 non-null object 11 address2 428 non-null object 12 address3 21 non-null object 13 display_address 1539 non-null object dtypes: float64(3), int64(1), object(10) memory usage: 180.4+ KB
# convert '€' to int(1),'€€' to int(2),'€€€' to int(3),'€€€€' to int(4)
data2 = data1
data2['price']=data2['price'].fillna('-')
for ind in data2.index:
try:
if data2['price'][ind] == '-':
continue
else:
data2.loc[ind, 'price'] = len(data2['price'][ind])
except KeyError:
print(n)
for ind in data2.index:
try:
if data2['price'][ind] == '-':
data2.loc[ind, 'price'] = np.nan
else:
continue
except KeyError:
print(n)
data2[data2['latitude'].isnull()]
| id | name | category | rating | reviews | price | zipcode | latitude | longitude | city | address1 | address2 | address3 | display_address | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1267 | 34Sqd8zCW705lJTK4-YaJA | The Back Page | pubs | 4.5 | 23 | 1 | NaN | NaN | NaN | Dublin | 199 Phibsboro Road | Phibsboro | NaN | ['199 Phibsboro Road', 'Phibsboro', 'Dublin', ... |
| 1586 | FEuyJ_bhika-QAU9DW78pg | coppers face jack | irish_pubs | 4.0 | 1 | NaN | NaN | NaN | NaN | Dublin | NaN | NaN | NaN | ['Dublin', 'Republic of Ireland'] |
# row 1267 has address1 which can be used for searching coordinates;
geolocator = Nominatim(user_agent="my_request")
location = geolocator.geocode("199 Phibsboro Road,Phibsboro,Dublin,Republic of Ireland")
print('纬度 = {}, 经度 = {}'.format(location.latitude, location.longitude))
data2.loc[1267,'latitude']=location.latitude
data2.loc[1267,'longitude']=location.longitude
纬度 = 53.3635526, 经度 = -6.272288
# row 1586 has no address information at all, and with only 1 review, so I decide to delete it
data2.drop(index=1586,inplace = True)
data2
| id | name | category | rating | reviews | price | zipcode | latitude | longitude | city | address1 | address2 | address3 | display_address | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | fR-pJ6nUn1bjPuT6lS2bsQ | The Brazen Head | pubs | 4.0 | 739 | 2 | 8 | 53.344970 | -6.276330 | Dublin | 20 Bridge Street Lower | NaN | NaN | ['20 Bridge Street Lower', 'Dublin 8', 'Republ... |
| 1 | A-HzqcGJVTwHVFTVH_LlPA | The Temple Bar | pubs | 4.0 | 550 | 2 | 2 | 53.345500 | -6.264190 | Dublin | 47/48 Temple Bar | Temple Bar | NaN | ['47/48 Temple Bar', 'Temple Bar', 'Dublin 2',... |
| 2 | rKvPQZcgjrQOLRU0phPoAQ | Queen of Tarts | desserts | 4.5 | 511 | 2 | 2 | 53.344121 | -6.267529 | Dublin | Cork Hill | Dame Street | NaN | ['Cork Hill', 'Dame Street', 'Dublin 2', 'Repu... |
| 3 | _449xLONUU9nAUzCja2bNA | The Porterhouse Temple Bar | pubs | 4.0 | 369 | 2 | 2 | 53.345100 | -6.267550 | Dublin | 16-18 Parliament Street | NaN | NaN | ['16-18 Parliament Street', 'Dublin 2', 'Repub... |
| 4 | -VIve-QeHR9-cKr7QldqtA | Elephant & Castle | tradamerican | 4.0 | 345 | 2 | 2 | 53.345600 | -6.262470 | Dublin | 18 Temple Bar | NaN | NaN | ['18 Temple Bar', 'Dublin 2', 'Republic of Ire... |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1729 | TVjY8fNyPccD2nNnKULtlw | Bradys Bar | pubs | 3.0 | 1 | 3 | 6 | 53.309898 | -6.283949 | Dublin | 5-9 Terenure Place | NaN | NaN | ['5-9 Terenure Place', 'Dublin 6', 'Republic o... |
| 1730 | nAj5_-n6auPSWYHGsgL3_g | Voici Creperie and Wine Bar | creperies | 1.0 | 1 | NaN | 6 | 53.321891 | -6.266197 | Rathmines | 1A Rathgar Road | NaN | NaN | ['1A Rathgar Road', 'Rathmines, 6', 'Republic ... |
| 1731 | MoTW58ukvPLNhf98U5i2hA | Grangers | pubs | 5.0 | 1 | NaN | 5 | 53.392990 | -6.214760 | Coolock | NaN | NaN | NaN | ['Coolock, 5', 'Republic of Ireland'] |
| 1732 | HkeaGMvNDqtiZ76ttjd8Kw | Martins Lounge | pubs | 2.0 | 1 | NaN | 11 | 53.390630 | -6.288100 | Dublin | 122 Ballygall Road W | NaN | NaN | ['122 Ballygall Road W', 'Dublin 11', 'Republi... |
| 1733 | mKSywJhByCWlDIG00K6byw | Comhaltas Ceoltóirí Éireann | recording_studios | 5.0 | 1 | NaN | NaN | 53.289089 | -6.207780 | Stillorgan | Cultúrlann na hÉireann, 32 Belgrave Square | Monkstown | NaN | ['Cultúrlann na hÉireann, 32 Belgrave Square',... |
1538 rows × 14 columns
data2[data2['zipcode'].isnull()]
| id | name | category | rating | reviews | price | zipcode | latitude | longitude | city | address1 | address2 | address3 | display_address | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 37 | jGWPezN-TLd8oQ9etopUqA | Cafe Azteca | mexican | 4.5 | 144 | 2 | NaN | 53.349805 | -6.260310 | Dublin | NaN | NaN | NaN | ['Dublin', 'Republic of Ireland'] |
| 150 | FSM9ID_UV9j0l4VL1YLkQw | Goose On The Loose | cafes | 4.5 | 65 | 1 | NaN | 53.337524 | -6.266099 | Dublin | 2 Kevin Street | NaN | NaN | ['2 Kevin Street', 'Dublin', 'Republic of Irel... |
| 234 | t7uffDe-mgo2b4_4TR4OdQ | Bow Lane | cocktailbars | 4.0 | 48 | 2 | NaN | 53.340270 | -6.265540 | Dublin | 17 Aungier Street | NaN | NaN | ['17 Aungier Street', 'Dublin', 'Republic of I... |
| 280 | sD87aoH4VezCI9ndUlx4pw | PÓG | salad | 4.0 | 41 | 1 | NaN | 53.347415 | -6.259912 | Dublin | 32 Bachelors Walk | Dublin 1 | NaN | ['32 Bachelors Walk', 'Dublin 1', 'Dublin', 'R... |
| 301 | ZU3U1dx6-j9q6xvgpK21OQ | Mykonos Taverna | greek | 4.0 | 38 | 2 | NaN | 53.344299 | -6.266500 | Dublin | 76 Dame Street | NaN | NaN | ['76 Dame Street', 'Dublin', 'Republic of Irel... |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1712 | sgClthFJpp4XD7nGg1ZUUg | D'Arcy McGees | pubs | 4.0 | 4 | 2 | NaN | 53.299407 | -6.309041 | Templeogue | Spawell Leisure Complex | NaN | NaN | ['Spawell Leisure Complex', 'Templeogue, Co. D... |
| 1717 | y4nHvXDaChGCyGcNU_k97w | Ciamei Cafe | cafes | 5.0 | 3 | 1 | NaN | 53.301400 | -6.177730 | Blackrock | The Blackrock Market | 19A Main Street | NaN | ['The Blackrock Market', '19A Main Street', 'B... |
| 1721 | AI623UyrkbjZSZS7RfNwJg | Brasserie @ Dublin Airport | irish | 2.0 | 2 | NaN | NaN | 53.396350 | -6.128800 | Dublin | The Street | Dublin Airport | NaN | ['The Street', 'Dublin Airport', 'Dublin', 'Re... |
| 1726 | kPxI4GAh769V1Xzlnx7ZDQ | Brass Bar & Grill | irish | 4.0 | 1 | NaN | NaN | 53.288170 | -6.196170 | Stillorgan | Stillorgan Road | NaN | NaN | ['Stillorgan Road', 'Stillorgan, Co. Dublin', ... |
| 1733 | mKSywJhByCWlDIG00K6byw | Comhaltas Ceoltóirí Éireann | recording_studios | 5.0 | 1 | NaN | NaN | 53.289089 | -6.207780 | Stillorgan | Cultúrlann na hÉireann, 32 Belgrave Square | Monkstown | NaN | ['Cultúrlann na hÉireann, 32 Belgrave Square',... |
121 rows × 14 columns
#Use coordinates to search its zipcode
import geopy
def get_zipcode(df, geolocator, lat_field, lon_field):
location = geolocator.reverse((df[lat_field], df[lon_field]))
try:
return location.raw['address']['postcode']
except KeyError:
return 'postcode not found'
#return location.raw['address']['postcode']
geolocator = geopy.Nominatim(user_agent='user_agents')
df = data2.loc[:,['latitude','longitude']]
#df.info()
zipcodes = df.apply(get_zipcode, axis=1, geolocator=geolocator, lat_field='latitude', lon_field='longitude')
data2['postcode']=zipcodes
data2.loc[data1['postcode']=='postcode not found']
data2['area']= data2['postcode'].str.slice(0, 3)
data2['area']
0 D08
1 D01
2 D01
3 D01
4 D01
...
1729 D06
1730 D06
1731 D05
1732 D11
1733 A94
Name: area, Length: 1538, dtype: object
#there is one row that has no zipcode, no price and only 1 review, so I decided to drop it
data3 = data2.drop(index=1536)
data3.loc[data3['category'].str.contains('bar'),['category']] = 'bar'
data3.loc[data3['category'].str.contains('bistros'),['category']] = 'bar'
data3.loc[data3['category'].str.contains('pub'),['category']] = 'pub'
data3.loc[data3['category'].str.contains('creperies'),['category']] = 'dessert'
data3.loc[data3['category'].str.contains('dessert'),['category']] = 'dessert'
data3.loc[data3['category'].str.contains('dimsum'),['category']] = 'dessert'
data3.loc[data3['category'].str.contains('donuts'),['category']] = 'dessert'
data3.loc[data3['category'].str.contains('bagels'),['category']] = 'bakery'
data3.loc[data3['category'].str.contains('bakeries'),['category']] = 'bakery'
data3.loc[data3['category'].str.contains('cakeshop'),['category']] = 'bakery'
data3.loc[data3['category'].str.contains('ramen'),['category']] = 'japanese'
data3.loc[data3['category'].str.contains('sushi'),['category']] = 'japanese'
data3.loc[data3['category'].str.contains('szechuan'),['category']] = 'chinese'
data3.loc[data3['category'].str.contains('delicatessen'),['category']] = 'delis'
data3.loc[data3['category'].str.contains('tapasmallplates'),['category']] = 'tapas'
data3.loc[data3['category'].str.contains('vegetarian'),['category']] = 'vegan'
data3.loc[data3['category'].str.contains('coffee'),['category']] = 'cafe'
data3.loc[data3['category'].str.contains('cafe'),['category']] = 'cafe'
the geopy needs a lot time to get the zipcode, so I save the file to avoid recomputing again
data3.to_csv('preprocessed.csv', sep=',', header=True, index=True,float_format = str)
import pandas as pd
import numpy as np
datapath = "preprocessed.csv"
data = pd.read_csv(datapath, index_col=0)
data.head()
| id | name | category | rating | reviews | price | zipcode | latitude | longitude | city | address1 | address2 | address3 | display_address | postcode | area | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | fR-pJ6nUn1bjPuT6lS2bsQ | The Brazen Head | pub | 4.0 | 739 | 2.0 | 8 | 53.344970 | -6.276330 | Dublin | 20 Bridge Street Lower | NaN | NaN | ['20 Bridge Street Lower', 'Dublin 8', 'Republ... | D08 WC64 | D08 |
| 1 | A-HzqcGJVTwHVFTVH_LlPA | The Temple Bar | pub | 4.0 | 550 | 2.0 | 2 | 53.345500 | -6.264190 | Dublin | 47/48 Temple Bar | Temple Bar | NaN | ['47/48 Temple Bar', 'Temple Bar', 'Dublin 2',... | D01 E8P4 | D01 |
| 2 | rKvPQZcgjrQOLRU0phPoAQ | Queen of Tarts | dessert | 4.5 | 511 | 2.0 | 2 | 53.344121 | -6.267529 | Dublin | Cork Hill | Dame Street | NaN | ['Cork Hill', 'Dame Street', 'Dublin 2', 'Repu... | D01 E8P4 | D01 |
| 3 | _449xLONUU9nAUzCja2bNA | The Porterhouse Temple Bar | pub | 4.0 | 369 | 2.0 | 2 | 53.345100 | -6.267550 | Dublin | 16-18 Parliament Street | NaN | NaN | ['16-18 Parliament Street', 'Dublin 2', 'Repub... | D01 E8P4 | D01 |
| 4 | -VIve-QeHR9-cKr7QldqtA | Elephant & Castle | tradamerican | 4.0 | 345 | 2.0 | 2 | 53.345600 | -6.262470 | Dublin | 18 Temple Bar | NaN | NaN | ['18 Temple Bar', 'Dublin 2', 'Republic of Ire... | D01 E8P4 | D01 |
1) There are 1537 restaurants & bars in Dublin that I got from Yelp API ;
2) Their average price is 1.92(a medium state), which means near '€€'( the highest one is '€€€€');
3) The average user rating is 3.79(5 is the full score);
4) In the all 1537 businesses, there are 595 belong to bars and 942 belong to restaurants, but most bars also have the function of restaurants there;
5) Bars and reataurants are similar in user rating and price.
overall = pd.DataFrame()
overall.loc[0,'City']='Dublin'
overall['Num of restaurants&bars'] = len(data)
overall['Avg price of res&bar'] = round(data.price.mean(),2)
overall['Avg review cnt of res&bar'] = round(data.reviews.mean(),2)
overall['Avg score of res&bar'] = round(data.rating.mean(),2)
overall['Num of restaurants'] = len(data.loc[(data['category']!='bar') &(data['category']!='pub')])
overall['Avg price of res'] = round(data.loc[(data['category']!='bar') &(data['category']!='pub')].price.mean(),2)
overall['Avg review cnt of res'] = round(data.loc[(data['category']!='bar') &(data['category']!='pub')].reviews.mean(),2)
overall['Avg score of res'] = round(data.loc[(data['category']!='bar') &(data['category']!='pub')].rating.mean(),2)
overall['Num of bars'] = len(data.loc[(data['category']=='bar') | (data['category']=='pub')])
overall['Avg price of bars'] = round(data.loc[(data['category']=='bar') | (data['category']=='pub')].price.mean(),2)
overall['Avg review cnt of bars'] = round(data.loc[(data['category']=='bar') | (data['category']=='pub')].reviews.mean(),2)
overall['Avg score of bars'] = round(data.loc[(data['category']=='bar') | (data['category']=='pub')].rating.mean(),2)
overall
| City | Num of restaurants&bars | Avg price of res&bar | Avg review cnt of res&bar | Avg score of res&bar | Num of restaurants | Avg price of res | Avg review cnt of res | Avg score of res | Num of bars | Avg price of bars | Avg review cnt of bars | Avg score of bars | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Dublin | 1537 | 1.92 | 30.06 | 3.79 | 942 | 1.91 | 32.76 | 3.8 | 595 | 1.95 | 25.77 | 3.78 |
areas = ['D01','D02','D03','D04','D05','D06','D07','D08','D09','D10','D11','D12','D13','D14','D15','D16',
'D18','D20','D24','A94']
description = []
for area in areas:
row = {"Areas": area}
row['amount'] = len(data.loc[(data['area']==area)])
row['avg_rating'] = round(data.loc[(data['area']==area),'rating'].mean(),2)
row['avg_reviews'] = round(data.loc[(data['area']==area),'reviews'].mean(),2)
row['avg_price'] = round(data.loc[(data['area']==area),'price'].mean(),2)
row['amount_res'] = len(data.loc[(data['category']!='bar') &(data['category']!='pub')& (data['area']==area)])
row['avg_rating_res'] = round(data.loc[(data['category']!='bar') &(data['category']!='pub')& (data['area']==area),'rating'].mean(),2)
row['avg_reviews_res'] = round(data.loc[(data['category']!='bar') &(data['category']!='pub')& (data['area']==area),'reviews'].mean(),2)
row['avg_price_res'] = round(data.loc[(data['category']!='bar') &(data['category']!='pub')& (data['area']==area),'price'].mean(),2)
row['amount_bar'] = len(data.loc[((data['category']=='bar') | (data['category']=='pub'))& (data['area']==area)])
row['avg_rating_bar'] = round(data.loc[((data['category']=='bar') | (data['category']=='pub')) & (data['area']==area),'rating'].mean(),2)
row['avg_reviews_bar'] = round(data.loc[((data['category']=='bar') | (data['category']=='pub')) & (data['area']==area),'reviews'].mean(),2)
row['avg_price_bar'] = round(data.loc[((data['category']=='bar') | (data['category']=='pub')) & (data['area']==area),'price'].mean(),2)
description.append(row)
description = pd.DataFrame(description).set_index("Areas")
description
| amount | avg_rating | avg_reviews | avg_price | amount_res | avg_rating_res | avg_reviews_res | avg_price_res | amount_bar | avg_rating_bar | avg_reviews_bar | avg_price_bar | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Areas | ||||||||||||
| D01 | 387 | 3.77 | 33.74 | 1.83 | 266 | 3.74 | 34.25 | 1.80 | 121 | 3.84 | 32.62 | 1.94 |
| D02 | 531 | 3.83 | 41.79 | 1.98 | 353 | 3.84 | 42.31 | 1.96 | 178 | 3.81 | 40.75 | 2.03 |
| D03 | 36 | 3.61 | 10.36 | 1.77 | 18 | 3.50 | 12.67 | 1.88 | 18 | 3.72 | 8.06 | 1.64 |
| D04 | 127 | 3.80 | 19.31 | 2.14 | 80 | 3.74 | 21.15 | 2.19 | 47 | 3.89 | 16.19 | 2.05 |
| D05 | 13 | 3.31 | 4.38 | 1.62 | 4 | 3.38 | 6.00 | 1.75 | 9 | 3.28 | 3.67 | 1.50 |
| D06 | 110 | 3.80 | 17.64 | 2.02 | 72 | 3.85 | 20.71 | 2.02 | 38 | 3.70 | 11.82 | 2.03 |
| D07 | 92 | 3.79 | 20.48 | 1.68 | 50 | 3.79 | 23.76 | 1.58 | 42 | 3.79 | 16.57 | 1.83 |
| D08 | 104 | 4.03 | 29.65 | 1.73 | 57 | 4.00 | 28.88 | 1.64 | 47 | 4.06 | 30.60 | 1.87 |
| D09 | 37 | 3.81 | 11.70 | 2.00 | 18 | 3.83 | 12.78 | 2.25 | 19 | 3.79 | 10.68 | 1.73 |
| D10 | 3 | 3.00 | 1.33 | NaN | 0 | NaN | NaN | NaN | 3 | 3.00 | 1.33 | NaN |
| D11 | 8 | 3.44 | 2.00 | 2.00 | 0 | NaN | NaN | NaN | 8 | 3.44 | 2.00 | 2.00 |
| D12 | 17 | 3.62 | 4.29 | 1.92 | 3 | 3.83 | 10.00 | 1.67 | 14 | 3.57 | 3.07 | 2.00 |
| D13 | 5 | 2.70 | 23.40 | 1.50 | 2 | 2.75 | 57.00 | 2.00 | 3 | 2.67 | 1.00 | 1.00 |
| D14 | 16 | 3.44 | 6.69 | 2.07 | 3 | 3.33 | 5.00 | 2.00 | 13 | 3.46 | 7.08 | 2.09 |
| D15 | 3 | 3.33 | 4.33 | 2.00 | 0 | NaN | NaN | NaN | 3 | 3.33 | 4.33 | 2.00 |
| D16 | 9 | 3.44 | 6.44 | 2.29 | 5 | 3.10 | 6.60 | 2.50 | 4 | 3.88 | 6.25 | 2.00 |
| D18 | 2 | 4.00 | 8.00 | 2.00 | 1 | 5.00 | 15.00 | 2.00 | 1 | 3.00 | 1.00 | NaN |
| D20 | 4 | 3.38 | 9.75 | 1.75 | 1 | 4.00 | 30.00 | 3.00 | 3 | 3.17 | 3.00 | 1.33 |
| D24 | 5 | 2.50 | 3.40 | 1.50 | 1 | 1.00 | 1.00 | NaN | 4 | 2.88 | 4.00 | 1.50 |
| A94 | 27 | 3.94 | 9.26 | 1.84 | 8 | 4.12 | 9.88 | 1.80 | 19 | 3.87 | 9.00 | 1.86 |
· In the map below, we can see that the Geographical distribution spreads from the city center to the suburban network along the main road of the city;
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go #plotly地图类的库 graph_objects
# 在地图上画散点图
scatter = go.Scattermapbox(lat=data['latitude']
,lon=data['longitude'] ,
hovertext = ['name'],
hoverinfo = ['name']
)
fig = go.Figure(scatter) #将散点图导入画布
fig.update_layout(mapbox_style='open-street-map') #将地图设置为画布
#可以使用的免费地图:"open-street-map", "carto-positron", "carto-darkmatter", "stamen-terrain", "stamen-toner" or "stamen-watercolor"
fig.show()
· Restaurants and bars are mainly concentrated in districts 1 and 2 in the city centre;
· Also some in Dublin4,6,7,8, but rarely in other distrits;
· The average price in D04 is highest while lowest in D07 & D08;
· D08: the user rating is higher in D08 than other districs, so maybe in D08 there are some restaurants both cheap and tasty
popular_district = ['D01','D02','D04','D06','D07','D08']
m1 = description.loc[['D01','D02','D04','D06','D07','D08'],['amount','avg_rating','avg_price']]
m1[['avg_rating','avg_price']].plot(kind='bar')
m1['amount'].plot(colormap='Purples_r',kind='line',secondary_y=True)
ax = plt.gca()
ax.set_xticklabels(popular_district)
plt.show()
plt.figure(figsize=(10,5))
grouped = description.amount.sort_values(ascending=False)[:10]
sns.barplot(grouped.index, grouped.values, palette=sns.color_palette("GnBu_r", len(grouped)) )
plt.xlabel('Area', labelpad=10, fontsize=14)
plt.ylabel('Count', fontsize=14)
plt.title('Count of Restaurants by Area (Top 10)', fontsize=15)
plt.tick_params(labelsize=14)
plt.xticks(rotation=15)
for i, v in enumerate(grouped):
plt.text(i, v*1.02, str(v), horizontalalignment ='center',fontweight='bold', fontsize=14)
/Users/fengluyu/opt/anaconda3/lib/python3.9/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
plt.figure(figsize=(10,5))
grouped = description.loc[description['amount']>10,'avg_price'].sort_values(ascending=False)
sns.barplot(grouped.index, grouped.values, palette=sns.color_palette("GnBu_r", len(grouped)) )
plt.xlabel('Area', labelpad=10, fontsize=14)
plt.ylabel('Price', fontsize=14)
plt.title('Price of Restaurants & Bars in different areas', fontsize=15)
plt.tick_params(labelsize=14)
plt.xticks(rotation=15)
for i, v in enumerate(grouped):
plt.text(i, v*1.02, str(v), horizontalalignment ='center',fontweight='bold', fontsize=14)
/Users/fengluyu/opt/anaconda3/lib/python3.9/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
the price for Most restaurants in Dublin is '€€' & '€', which is a low to medium level.
plt.figure(figsize=(10,5))
grouped = data.price.value_counts().sort_index()
sns.barplot(grouped.index, grouped.values, palette=sns.color_palette("RdBu_r", len(grouped)))
plt.xlabel('price', labelpad=10, fontsize=14)
plt.ylabel('Count of restaurants', fontsize=14)
plt.title('Count of Restaurants against prices', fontsize=15)
plt.tick_params(labelsize=14)
for i, v in enumerate(grouped):
plt.text(i, v*1.02, str(v), horizontalalignment ='center',fontweight='bold', fontsize=14)
/Users/fengluyu/opt/anaconda3/lib/python3.9/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
· Apart from bars&pubs, the most numerous types of restaurants in Dublin are:cafe, irish, italian, pizza, indpak...
· In addition to Western restaurants, there are also many East Asian restaurants, such as chinese, japanese, thai and so on. This means that Dublin might be a city with a more diverse food culture.
plt.figure(figsize=(10,5))
grouped = data.category.value_counts().sort_values(ascending=False)[:12]
sns.countplot(y='category',data=data,
order = grouped.index, palette= sns.color_palette("RdBu_r", len(grouped)))
plt.xlabel('Areas', labelpad=21, fontsize=14)
plt.ylabel('Amount', fontsize=14)
plt.title('Count of Restaurants & Bars by Category(Top 12)', fontsize=15)
for i, v in enumerate(grouped):
plt.text(i, v*1.02, str(v), horizontalalignment ='center',fontweight='bold', fontsize=14)
Firstly, we look through the businesses with top amount of reviews, which means many people have tried there.
It's easily to find that although some restaurants has many reviews, they are maybe negative reviews.
Those kind of low user rating restaurants are not what I need.
top_reviewed = data[['name','reviews','rating']].sort_values(by='reviews', ascending=False)[:10]
top_reviewed
| name | reviews | rating | |
|---|---|---|---|
| 0 | The Brazen Head | 739 | 4.0 |
| 1 | The Temple Bar | 550 | 4.0 |
| 2 | Queen of Tarts | 511 | 4.5 |
| 3 | The Porterhouse Temple Bar | 369 | 4.0 |
| 952 | The Bank on College Green | 369 | 4.5 |
| 4 | Elephant & Castle | 345 | 4.0 |
| 5 | Cornucopia | 334 | 4.5 |
| 6 | Brother Hubbard | 329 | 4.5 |
| 954 | The Hairy Lemon | 324 | 4.0 |
| 7 | O'Neills Bar & Restaurant | 281 | 3.5 |
plt.figure(figsize=(11,6))
grouped = data[['name','reviews']].sort_values(by='reviews', ascending=False)[:10]
sns.barplot(x=grouped.reviews, y = grouped.name, palette=sns.color_palette("GnBu_r", len(grouped)), ci=None)
plt.xlabel('Count of Review', labelpad=10, fontsize=14)
plt.ylabel('Restaurants', fontsize=14)
plt.title('TOP 9 restaurants with Most Reviews', fontsize=15)
plt.tick_params(labelsize=14)
plt.xticks(rotation=15)
for i, v in enumerate(grouped.reviews):
plt.text(v, i, str(v), fontweight='bold', fontsize=14)
So when we choose restaurant, we need to think about both review count and user rating.
1) Selct businesses that have more than 100 reviews and get user raiting >= 4.5, ranking by the review count and add the top 10 businessed in my wishlist.
2) Many reviews means more people tried this restaurant before, which makes user ratings more credible.
3) The 'top_business' below is what I plan to taste in the future
top_bussiness = data.loc[data['reviews']>=100, ['id','name','category','price','rating','reviews']].sort_values(by=['rating','reviews'], ascending=False)[:10]
top_bussiness_id = top_bussiness.id.tolist()
top_bussiness
| id | name | category | price | rating | reviews | |
|---|---|---|---|---|---|---|
| 70 | jO5EkqNn6IiypNfAAr_WfA | Green Bench Café | salad | 1.0 | 5.0 | 100 |
| 2 | rKvPQZcgjrQOLRU0phPoAQ | Queen of Tarts | dessert | 2.0 | 4.5 | 511 |
| 952 | dlClCiMV4Y8yTc9vCUCABw | The Bank on College Green | pub | 2.0 | 4.5 | 369 |
| 5 | ZdZNRZ1OdQ1MYfaK0vsbNw | Cornucopia | vegan | 2.0 | 4.5 | 334 |
| 6 | DM0Tcka4QpP4YqCfJ5nL1g | Brother Hubbard | mideastern | 2.0 | 4.5 | 329 |
| 10 | LG37RcSre8vSlS-5uJE2DA | The Bakehouse | bakery | 1.0 | 4.5 | 261 |
| 956 | bwhASCB14C2mlmctXcsKtA | The Stag's Head | pub | 2.0 | 4.5 | 259 |
| 13 | iNk7KmI1j-tfPGSNs6RXvg | The Pig's Ear | bar | 3.0 | 4.5 | 223 |
| 18 | cOpu16xeZUJNnhhUs71MJA | L Mulligan Grocer | pub | 2.0 | 4.5 | 199 |
| 957 | gVVBwMK1bd53VvT51XtPVQ | Vintage Cocktail Club V.C.C | bar | 3.0 | 4.5 | 190 |
import json, requests
key = 'akX_q_pdbj6rh2xLXX35DTVNvqdD3T3F9gopk7Zqtp98hZF0gFEzGAdkZZEDwUU9o0iLsq0tOJN99o9eknmi8SvB1hz2FmFsQBzT9Oq9lVyLxyvACu72aem38OBWY3Yx'
headers = {'Authorization': 'bearer %s' % key}
reviewdata = {'review_id':[],'user_id':[],'business_id':[],'text':[],'datetime':[]}
for bussiness_id in top_bussiness_id:
url = 'https://api.yelp.com/v3/businesses/'+bussiness_id+'/reviews'
response = requests.get(url,headers=headers)
#print(response.json())
query = response.json()['reviews']
for q in query:
reviewdata['review_id'].append(q['id'])
reviewdata['text'].append(q['text'])
reviewdata['user_id'].append(q['user']['id'])
reviewdata['business_id'].append(bussiness_id)
#reviewdata['rating'].append(q['rating'])
reviewdata['datetime'].append(q['time_created'])
#result['useful'].append(q[''][''])
#result['funny'].append(q[''][''])
#result['cool'].append(q[''][''])
reviewdata = pd.DataFrame(reviewdata)
reviewdata.to_csv('yelp_review.csv', sep=',', header=True, index=True,float_format = str)
reviewpath = "yelp_reviews.csv"
reviews = pd.read_csv(reviewpath, index_col=0)
reviews.head()
| user_id | user_name | datetime | business_id | business_name | text | |
|---|---|---|---|---|---|---|
| num | ||||||
| 0 | fTqlWcqiFIVNrfXbF6C2mw | Peter W. | 30/4/2022 | jO5EkqNn6IiypNfAAr_WfA | Green Bench Café | Charming location, charming service, delicious... |
| 1 | d_TBs6J3twMy9GChqUEXkg | Jennifer O. | 26/6/2021 | jO5EkqNn6IiypNfAAr_WfA | Green Bench Café | I ate here precovid, I tried ir br ket s wich... |
| 2 | 8DwIFAcbzhwnZdd7LC_JuQ | Brad D. | 24/6/2022 | jO5EkqNn6IiypNfAAr_WfA | Green Bench Café | Just Perfect Food w h a smile. What else coul... |
| 3 | lWcyfDKDlHSk3yclJ-tkiw | Mark G. | 9/8/2019 | jO5EkqNn6IiypNfAAr_WfA | Green Bench Café | Yummy healthy food. Great soups. Vegan, Ve a... |
| 4 | LfgC6aypR9dnH6oSceMP9g | JI X. | 17/9/2019 | jO5EkqNn6IiypNfAAr_WfA | Green Bench Café | Th st s wich I've ever had. Although 's... |
import re
import pandas as pd
import nltk
import collections
from nltk.stem import PorterStemmer
from nltk.corpus import wordnet
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
corpus = []
for ids in top_bussiness_id:
corpus.append(reviews['text'][reviews['business_id']==ids].tolist())
docs=['','','','','','','','','','']
for n in range(len(corpus)):
#print('n = ',n)
for j in range(len(corpus[n])):
#print('j = ',j)
docs[n]= docs[n]+corpus[n][j]
stopwordss = set(stopwords.words('english'))
tokenizer = nltk.tokenize.RegexpTokenizer('\w+')
symbol = ['!','"','#','$','%','&','\\','(',')','*','+',',','-','.','/','_','?']
stopwordsss = ['the','it','is','you','and','have','get','this','also','are','to','be','was','little']
docs1 = []
docs2 = []
docs3 = []
docs4 = []
for doc in docs:
docs1.append(doc.lower())
for doc in docs1:
doc = tokenizer.tokenize(doc)
docs2.append(doc)
for doc in docs2:
#print(doc)
for word in doc:
#print(word)
if ((word.isnumeric()==True)| (word in symbol)):
doc.remove(word)
else:
continue
docs3.append(doc)
for doc in docs3:
for word in doc:
#print(word)
if ((word in stopwordss) | (word in stopwordsss)):
doc.remove(word)
else:
continue
docs4.append(doc)
corpus = ['','','','','','','','','','']
for t in range(len(corpus)):
for word in docs4[t]:
corpus[t]= corpus[t] +','+word
1) Use 'The Bank on College Green' as an example:
we can easily get the feature words for 'The Bank on College Green' are 'drink','dinner','bar', with 'great service' & 'atmosphere'.
top_bussiness.loc[top_bussiness['name']=='The Bank on College Green']
| id | name | category | price | rating | reviews | |
|---|---|---|---|---|---|---|
| 952 | dlClCiMV4Y8yTc9vCUCABw | The Bank on College Green | pub | 2.0 | 4.5 | 369 |
#!pip install wordcloud
from wordcloud import WordCloud
from wordcloud import ImageColorGenerator
wordcloud = WordCloud( background_color="white").generate(corpus[2])
plt.figure( figsize=(15,10))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()
2) Use 'Queen of Tarts' as the other example
we can get different feature words for 'Queen of Tarts' are 'breakfast','cake','coffee',with 'sweet' & 'delicious' taste ,which is totally different with 'The Bank on College Green'.
top_bussiness.loc[top_bussiness['name']=='Queen of Tarts']
| id | name | category | price | rating | reviews | |
|---|---|---|---|---|---|---|
| 2 | rKvPQZcgjrQOLRU0phPoAQ | Queen of Tarts | dessert | 2.0 | 4.5 | 511 |
wordcloud = WordCloud( background_color="white").generate(corpus[1])
plt.figure( figsize=(15,10))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()
In this task,
I have a preliminary understanding of the geographical distribution, catering type, price, etc. of Dublin's restaurants and bars;
Screening out 10 popular and high -scoring restaurants and bars, adding to my food list, after the plan, and after the plan Friends to try together;
By analyzing the text of user reviews, we generally understand the characteristics of these restaurants. Some are delicious desserts, some are good for the bar, and some are good at the atmosphere of dinner. I will gradually try according to my own life needs. These 10 restaurants.
Two challenges appeared in the process of using Yelp API,
Yelp Business only allows a maximum of 1,000 pieces of data, which means that if I need to repeat the data multiple times to obtain all restaurants in the city;
Yelp Reviews only returns a fixed 3 user evaluation in a single number, and the three evaluations are not enough to do natural language analysis. After I was on Google, I re -used the Beautiful Soup on the Yelp official website to capture every evaluation. This is very bad. First of all, a new crawler task consumes a lot of time. If you come again, I will choose a project that can directly obtain data from the API. Secondly, the data obtained by crawlers is not as complete as the API. These are two very different data acquisition methods, which may not be suitable for very rigorous data analysis. For example, a store records 300 evaluations in Yelp Business, but only 286 reptiles have been captured.
At present, I found that Yelp has offline complete datasets for download and analysis, which can solve the problem that complete data cannot be obtained from the api. However, downloading offline datasets does not meet our requirements for this assignment.
If there is a chance to try it later,
I can do more in-depth text analysis from the complete and massive offline dataset, such as understanding what the keywords are for people’s positive and negative reviews of some restaurants, so as to know which strengths the restaurant needs to maintain and which weaknesses it needs to improve.
I've had several favorite restaurants (delicious food) that went out of business due to poor management (low publicity, poor service, poor location, etc.). This makes me very sad because it's so hard to find new alternative restaurants. If I come across a restaurant I like very much in Dublin, I may start from this angle and give the restaurant some management advice, hoping that they can maintain a good taste and operate for a long time.